-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Doc][Train] Add accelerator_type
to Ray Train user guide
#44882
[Doc][Train] Add accelerator_type
to Ray Train user guide
#44882
Conversation
Signed-off-by: Hongpeng Guo <[email protected]>
Tip if you haven't seen this already: we build the docs as part of the premerge CI, so you can take a look at your rendered docs: https://anyscale-ray--44882.com.readthedocs.build/en/44882/index.html |
Nice tips! ty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Made some edit suggestions.
Sometimes you might want to specify the accelerator type for a worker. For example, | ||
you can specify `accelerator_type="A100"` in the `ScalingConfig` if you want to | ||
assign the worker an NVIDIA A100 GPU. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes you might want to specify the accelerator type for a worker. For example, | |
you can specify `accelerator_type="A100"` in the `ScalingConfig` if you want to | |
assign the worker an NVIDIA A100 GPU. | |
Ray Train allows you to specify the accelerator type for each worker. | |
This is useful if your model training has some GPU memory constraints that requires a specific type of GPU. | |
In a heterogeneous Ray cluster, this means that your training workers will be forced to run on the specified GPU type, rather than on any arbitrary GPU node. | |
For example, you can specify `accelerator_type="A100"` in the :class:`~ray.train.ScalingConfig` if you want to | |
assign each worker a NVIDIA A100 GPU. |
import torch | ||
from ray.train import ScalingConfig | ||
from ray.train.torch import TorchTrainer, get_device | ||
|
||
|
||
def train_func(): | ||
assert torch.cuda.is_available() | ||
|
||
device = get_device() | ||
assert device == torch.device("cuda:0") | ||
|
||
trainer = TorchTrainer( | ||
train_func, | ||
scaling_config=ScalingConfig( | ||
num_workers=1, | ||
use_gpu=True, | ||
accelerator_type="A100" | ||
) | ||
) | ||
trainer.fit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can cut this down to just show the ScalingConfig
.
from ray.train import ScalingConfig | ||
from ray.train.torch import TorchTrainer | ||
|
||
|
||
trainer = TorchTrainer( | ||
train_func, | ||
scaling_config=ScalingConfig( | ||
num_workers=1, | ||
use_gpu=True, | ||
accelerator_type="A100" | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, for this one, I'm thinking of just showing:
ScalingConfig(...)
Ensure that your cluster has instances with the specified accelerator type | ||
or is able to autoscale to fulfill the request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make this a tip:
.. tip::
Ensure that your cluster has instances with the specified accelerator type
or is able to autoscale to fulfill the request.
Otherwise, your job will hang forever due to unsatisfiable pending resource requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nice tip structure. Will try.
Signed-off-by: Hongpeng Guo <[email protected]>
…accelerator-type
Signed-off-by: Hongpeng Guo <[email protected]>
ScalingConfig( | ||
num_workers=1, | ||
use_gpu=True, | ||
accelerator_type="A100" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the indent here?
Setting the GPU type | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
Ray Train allows you to specify the accelerator type for each worker. | ||
This is useful if your model training has some GPU memory constraints that requires a specific type of GPU. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users may want to use different accelerator types not only for GPU memory constraints, but also for e.g. compute power, cost efficiency, availability, etc.
Let's just say This is useful if you want to use a specific accelerator type for model training.
Signed-off-by: Hongpeng Guo <[email protected]>
Signed-off-by: Hongpeng Guo <[email protected]>
Signed-off-by: Hongpeng Guo <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
accelerator_type
to Ray Train user guidesaccelerator_type
to Ray Train user guide
Why are these changes needed?
Our
ScalingConfig()
function supports a new argumentaccelerator_type
. This PR provides a user guide with example code to showcase the usage. The generated section of the user guide is appended below:Related issue number
"Closes #44763
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.